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1. INTRODUCTION 

Currently, e-commerce platforms become a major way to promote products and services. 
It consists a wide range of interaction processes between various market users from ordering and delivering 
products until the sellers issue invoices and users make payments [1]. The fast development of e-commerce is 
contributed the most by the availability of different products and the easiness of transacting money 
over internet [2]. There are large inventories with various kind of products sold through online store websites 
such as Walmart, Amazon, Alipay and eBay. Users are able to view and buy many new products from the 
websites from time to time. Most of the websites are well-structured and they consist of product information 
such as the product name, description, price, and image. The often inclusion of new products in e-commerce 
websites leads to an important task of classifying a product title to assist sellers listing an item in a suitable 
category. Some system will perform classification directly after obtaining title of the product. Most of the 
text mining process involves a bundle of documents consist of lengthy words but title description normally is 
a short text. Titles are different from texts in various aspects such as the length of each sentence is very short, 
it consists of similar distribution of lengths, and the grammatical structure is mostly incompleted [3]. 
There is a study related to product title classification, but it focused on identifying general properties of 
product titles [3-4]. The most related but broader area is a short-text classification, which already consists of 
various literature. There are studies involve classifying short text such as question classification [5-6], 
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semantic annotation classification [7] and job title classification [8]. However, question and semantic 
annotation classification normally uses complete sentences and job title classification deals with shorter 
sentences compared to product title. 

Besides that, there are several studies take into consideration other sources such as price 
of the product to be the additional feature with the aim to increase the accuracy rate of the 
classification model [9-10]. However, this study solely consider text from title of products. Previously, 
it was found that the properties of product title classification are different from those of text classification 
where stemming and stop-word removal are not needed in dealing with product titles [3]. This study is in the 
same direction, sharing the same spirit of keeping model simplicity and interpretability. The difference is the 
previous studies focused on properties of product titles, which involve transforming each pair of word tokens 
from the text sequences to useful features. In this paper, the authors main purpose is identifying the most 
suitable classification models to deal with short-text data especially involving e-commerce product titles. 
The classification of e-commerce products based on title of a product can be done using supervised learning 
model. A supervised learning model is able to solve classification problems because the goal is to make the 
computer learns a classification system that has been created. There are various kind of supervised learning 
models have been applied in many fields of studies such as pattern recognition [11], natural language 
processing [12], market segmentation [13] and bioinformatics [14]. Nevertheless, the comparison between 
well-known supervised learning models including Naive Bayes, K-Nearest Neighbor (KNN), Decision Trees, 
Support Vector Machine (SVM) and Random Forest is not yet to be seen in a research related to product title 
classification. It is crucial to evaluate the performance of each model because the results provide useful 
information about the best model to classify this kind of data. Hence, this paper aims to compare the 
performance of supervised learning models which are Naive Bayes, KNN, Decision Trees, SVM and 
Random Forest in classifying title of e-commerce products. The rest of this paper is organized as follows: 
Section 2 describes researches related to product title classification; Section 3 presents the research 
methodology includes description of data sets and research design used in the study; Section 4 shows the 
results obtain from the comparison of the supervised learning models towards e-commerce product 
classification; Section 5 concludes the findings from the research. 


2. RESEARCH METHOD 
2.1. Data description 

Department of Statistics Malaysia (DOSM) has collected product information from one of the major 
online store website through STATSBDA project known as Price Intelligence (PI) using its prototype web 
scraper. A few leaf nodes were used to represent the chosen categories from the browse tree of the website. 
Table | presents the description of the two corpora selected for this research which are fresh food and 
household products data sets. The five categories under Fresh Food data set are fresh meat & poultry, 
fish & seafood, bakery, fresh fruits, and noodles. On the other hand, the five categories under Household data 
set are toilet cleaner, air freshener, floor cleaners, light bulbs, and household sundries. 


Table 1. Summary description of data sets 


Dataset Category Instance | Number of Feature Number of Feature after Feature Selection 
Fresh Food 5 447 88 78 
Household 5 684 138 116 


2.2. Research design 

There were several steps need to be done in this research before classifying the data as presented 
in Figure 1. The steps were data extraction, data pre-processing, feature extraction and feature selection. 
These were the basic procedures in research related to text classification. There were three preprocessing 
steps involved after data extraction which were tokenization, stop word removal and stemming [15]. 
The data preprocessing is a crucial step to ensure the data is standardized and in a proper form. 
The standardized form was achieved after applying the three preprocessing steps where product descriptions 
were tokenized into words at first. Then, stop words were removed from the word list and the remaining 
words were stemmed to ensure the words followed the root word forms. The feature extraction and selection 
are important to make sure the data are well transformed into significance and good features before 
performing the classification process [16]. The selection of features may affect the accuracy of a 
classification model. Hence, the research had utilized the bag-of-word and correlation feature selection 
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technique to perform data extraction and selection respectively. Then, the chosen features were used as inputs 
to perform different classification models from supervised learning models. 


Classification 
Using Supervised 
Learning Model 


Data Data Feature Feature 
Extraction Pre-Processing Extraction Selection 


Tokenization Naive Bayes 
Stop Word KNN 
Removal Decision Tree 
Stemming SVM 

Random Forest 


Figure 1. Flowchart of the Research 


Supervised learning model is used to make predictions based on information about the targets and 
the features of data. It infers a function according to a given set of input-output data respectively. 
Normally, the input data provides a set of observations with which the computer is trained [17]. 
Each observation consists of an input vector and a desired output value. A supervised learning model trains the 
data and generates a general rule or function to be used for predicting or classifying new inputs. 
There were five supervised learning models that been used in the study which are Naive Bayes, 
K-Nearest Neighbor (KNN), Decision Tree, Support Vector Machine (SVM) and Random Forest. Naive 
Bayes is a classification model based on Bayes theorem introduced by Thomas Bayes and it has been used as 
conventional paradigm since late 18" century [18]. It is one of probabilistic-based classifiers where it predicts 
the probability of the sample itself before choosing the class with highest probability given the observation. 
It is widely used in text categorization, sentiment analysis and spam filtering [15]. The algorithm for Naive 
Bayes [19] as shown in Figure 2. 


Dataset D = {x1,...,Xn} C R? 


1% step: 
Assign all the words in D as the feature vector, V 


24 step: 
Train the model by, 
Repeat 
for each category c; € C, 
let d;be the subset of samples in D with category c; 
P(c;) = |d;|/|D| 
let t; be the concatenation of all the samples in d; and n; be the total number of 
occurrences in t; 
for each feature x;eV 
let n,;; be the number of occurrences of w; in t; 


let P(x;|¢;) = (nj + 1)/(n: + IVI) 


3"4 step: 
Given a test sample of X and let the number of feature occurrence in X, 
Find the category by, 


n 
arg max P(c;) I] P(a;|c;) 
cj=C 
i=1 


Where a; is the feature occurring at the i position in X 


Figur 2. Naive Bayes Algorithm 


KNN is the fundamental classification model to classify observations according to the closest 
training examples in the feature space when there is little or no prior knowledge on the distribution of the 
data [20]. It is an instance-based learning where the function is close to local value and the computations are 
deferred before the classifying process occurs. Basically, the rule holds the training set during the learning 
process. Then, each observation is assigned to a class according to the majority label of its KNN in the 
training dataset. A sample should be grouped into its similar surrounding samples. Thus, the nearest neighbor 
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samples can be considered to classify or predict an unknown sample. The algorithm for KNN [21] 
as shown in Figure 3. 


Dataset D = {%,..., X,} with training categories cy, ...,c; and sum of training sample is N 


1% step: 
Repeat 
for each input vector (X,,..., Xn). 
do 
Calculate the similarities between training samples. 
e.g. i@ sample of d;(djz, ..., dim) where the similarity SIM(x, d;) can be represented by, 
m 
SIM (x, d;) = =e 


2 
(SJE xp)? DFE diy 
294 step: 


Choose k samples which are larger from N similarities of SIM(x, d;), (i = 1,2,...,N) and serve them 
as KNN collection of x. 


3"4 step: 
Calculate the probability of x belong to different categories accordingly using, 
P(x, cj) = > SIM(x,d;).y(di.cj) 


a 
Where y(d;.c;) defined by category attribute function that satisfied, 


1,d,€¢G 
y(di.c)) = Ibe € . 


4% step: 
Find the category for the sample that has the largest P(x, c;) 


Figure 3. K-Nearest Neighbor (KNN) Algorithm 


Decision Tree is a model with flowchart-like structure. It is created by a tree and a set of rules 
representing each of the classes from a dataset. Decision Tree consists of three main elements which are 
internal node, branch and class label where each of them represents a test attribute, a test outcome and a leaf 
node respectively [22]. Figure 4 shows the algorithm for Decision Tree [15]. 


D = Training Dataset, M = Input Attributes and N = Target Attributes 


1st step: 
Create a tree, T = TreeGrowing(D,M, N) 
If 
one of the stopping criteria is fulfilled then mark the root node in T as a leaf 
with the most common value of N in D as the class. 
else 
Find a discrete function f(M) of the input attributes values such that splitting 
D based on outcome of f(M) that gains the best splitting metric. 
If the best splitting metric > threshold then label the root node of T as f(M) 
For each outcome v; of f(M) do 
subtree; = TreeGrowing (df(m) =v;D,M, N) 
Connect the root node of T to subtree; with an edge that is labelled v; 
end for 
else 
Mark the root node in T as a leaf with the most common value of M in D as 
the class. 
end if 
end if 
The return outputis the value of T 


24 step: 
Prune the tree, T = TreePruning(D,T,N) 
Select a node t in T and prune it maximally improve some evaluation criteria 
Ift + > then 
T = pruned(T,t) 
end if 
Repeat until t = @ 
The return outputis the value ofthe current T 


Figure 4. Decision Tree Algorithm 
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Support Vector Machine (SVM) is usually used for classification and was introduced [23]. 
It works based on the calculation of margins between the classes. The margins are drawn to minimize the 
classification error when the distance between the margin and the classes is a maximum. SVM had been 
applied into various fields of studies such as gene expression, text classification and 
image identification [24]. This model is considered to give good generalization accuracy, but it may cause a 
quadratic optimization problem with bound constraints and a lack of linear equality in the training process. 
The algorithm for SVM [25] as shown in Figure 5. 


Dataset D = {(x1)4, ---,XnYn)} Where x; is a n-dimensional vector and y; is denoted as 
class of 1 or -1 to each point belongs to x;. 


15 step: 
Train the model using the function of 
f(x) =v.x —b 
Where v is the weight vector and b is the bias. The condition needs to be satisfied is, 
yi(v.x; —b > 0,V (%;,y;) € D 


24 step: 
Maximize the margin which is the distance from the hyperplane to the closest data points. 
The distance is formulated as, 
If@il 
lvl 


distance = 
Hence, the margin can be written as, 
: 1 
margin = —— 
oe Tell 


The training problem is presented by, 


t 
minimize: Q(v) = 3 lvl I? 
subject to: y;(v.x; — b) = 1,V(x;,y;) € D 


Figure 5. Support Vector Machine (SVM) Algorithm 


Random Forest is also known as the ensemble of decision tree algorithm. Figure 6 shows the 
algorithm for Random Forest [26]. It consists of a collection of tree-structured classifiers where each of the 
classifiers is an independent identically distributed random vector. This algorithm can maintain its 
performance even though the data consists of a large proportion of missing values [27]. All the steps were 
computed using R-Programming software. 


N = Number of nodes 
M = Number of features 
D = Number of trees to be constructed 


15 step: 
Randomly draw a bootstrap sample A from the training data D 


24 step: 

Construct tree T; from the drawn bootstrapped sample A using, 

i. Randomly select m features from M where m < M 

ii. Calculate the best split point among the m features for node d 
iii. Split the node into two daughter nodes using the best split 

iv. Repeat i to iii until n number of nodes has been reached 


The forest is built by repeating the steps i to iv for D number of times 


3"4 step: 
Output all the constructed trees {7;}1D and apply a new sample to each of the constructed 
trees starting from the root node 


4t step: 
Assign the sample to the class according to the leaf node 
Combine the decisions of all trees and Find the highest votes as the class for the sample 


Figure 6. Random Forest Algorithm 
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3. RESULTS AND DISCUSSION 

The evaluation was done by observing the classification results of five algorithms from supervised 
learning model. The analysis was made on two different datasets which were Freshfood and Household data 
sets as shown in Table 2. The highest accuracy for the data with five categories was performed 
by KNN model. On the other hand, the performance of Random Forest model was highly good as KNN 
model, but the performance of other classification models was less than 70%. Besides that, the result from 
Fresh Food data led to approximately similar conclusion as the result obtained from Household data. 
The highest accuracy rate to classify the data consisted of five categories is KNN model. Specifically, only 
KNN and Random Forest models showed good accuracy rates compared to other classification models. 
The Naive Bayes model was the worst classifier among the five algorithms used in the study to classify both 
of the data. The execution time for each of the classification model is shown in Table 3. Despite the highest 
accuracy rates provided by KNN model for both data sets, it also the fastest classifier compared to other 
classification models used in the study. Even thought, the performance of Random Forest was fairly 
good compared to KNN but it was the slowest classifier to provide the classification results among 
the five algorithms. 


Table 2. Accuracy of classification models 


Dataset 
Method Household Fresh Food 
Naive Bayes 16.99 14.07 
KNN 94.66 82.96 
Decision Tree 85.44 67.41 
SVM 69.42 44.44 
Random Forest 93.69 78.52 
Table 3. Accuracy of classification models 
Dataset 
Method Household Fresh Food 
Naive Bayes 1.20 0.78 
KNN 0.82 0.63 
Decision Tree 0.94 0.66 
SVM 0.99 0.65 
Random Forest 1.51 LU? 


From the results, it was obvious that KNN model outperformed other supervised learning models. 
The performance of Random Forest model was not far behind the KNN model but the weakneses of the 
model can be seen in term of computation time. This results inline with previous study by reference [28] 
wherein the performance of both models were preferable compared to other supervised learning models 
toward breast cancer data. Meanwhile, several studies had also found that KNN model is superior in 
classifying different kind of data [29-31]. Among the algorithms based on supervised learning models used in 
the study, Naive Bayes performed not as good as the other algorithms. It is proved that the performance of 
Naive Bayes model is affected by the distribution of the data [32]. Normally, it performed well on the real- 
world data where the nature of the data easily changes over the time. However, the data used in the study 
were independent and identically distributed data. 


4. CONCLUSION 

The paper presents comparative evaluation of five well-known algorithms from supervised learning 
model for the problem related to e-commerce product titles classification. On the whole, KNN model 
performed the best among the five supervised learning models. The simplicity of the model suits the 
requirement to classify a short text such as e-commerce product titles. The performance of KNN model can 
be enhanced by investigating the optimal number of neighbors (K) value. 
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